WASSUP? LOL : Characterizing Out-of-Vocabulary Words in Twitter
نویسندگان
چکیده
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). CSCW’16 Companion, February 27 March 02, 2016, San Francisco, CA, USA ACM 978-1-4503-3950-6/16/02. http://dx.doi.org/10.1145/2818052.2869110 Abstract Language in social media is mostly driven by new words and spellings that are constantly entering the lexicon thereby polluting it and resulting in high deviation from the formal written version. The primary entities of such language are the out-of-vocabulary (OOV) words. In this paper, we study various sociolinguistic properties of the OOV words and propose a classification model to categorize them into at least six categories. We achieve 81.26% accuracy with high precision and recall. We observe that the content features are the most discriminative ones followed by lexical and context features.
منابع مشابه
Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both ...
متن کاملCrowd Sentiment Detection during Disasters and Crises
Microblogs are an opportunity for scavenging critical information such as sentiments. This information can be used to detect rapidly the sentiment of the crowd towards crises or disasters. It can be used as an effective tool to inform humanitarian efforts, and improve the ways in which informative messages are crafted for the crowd regarding an event. Unique characteristics of microblogs (lack ...
متن کاملMining Twitter for New Words
New lexical elements such as LOL are appearing in natural digital language at high frequencies. The usage of these elements suggests that they are being treated like real words. The first step in examining this type of element is to identify them. We gathered 2,798 messages within a 10-mile radius of a specific GPS location for a 10.5 hour period. The novel elements were identified by excluding...
متن کاملWord Normalization in Twitter Using Finite-state Transducers
This paper presents a linguistic approach based on weighted-finite state transducers for the lexical normalisation of Spanish Twitter messages. The system developed consists of transducers that are applied to out-of-vocabulary tokens. Transducers implement linguistic models of variation that generate sets of candidates according to a lexicon. A statistical language model is used to obtain the m...
متن کاملReview of Twitter sentiment analysis
Twitter data has recently been considered to perform a large variety of advanced analysis. Analysis of Twitter data imposes new challenges because the data distribution is intrinsically sparse, due to a large number of messages post every day by using a wide vocabulary. Sentiment Analysis task is divided in two steps: Feature selection methods and Sentiment classification methods. Feature selec...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016